Linear Regression

Linear Regression is regression model, which means it predicts quantitative value. It answers the question, How much ?

Consider a housing price challenge, based on the square feet of the house i.e. house size the house price varies.


In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LinearRegression
import warnings

warnings.filterwarnings(action='ignore')

In [79]:
house_price = {"size":[1000,1500,2000,1200,1300],"price": [50000, 80000, 120000, 55000, 65000]}

In [80]:
plt.scatter(house_price["size"],house_price["price"])


Out[80]:
<matplotlib.collections.PathCollection at 0x1a20d2dc50>

Housing Prices

We can see that the points are increasing linearly i.e. if size increases then price increases. Now to find the price for 1800 sq.ft. We can fit a line that pass through these points. The question is how to fit a line through the points either it should exactly pass through the points or an approximate line is enough.

If we fit an exact line through the points, we can get exact prediction(house price) for these three points(house sizes) alone. Which means we are overfitting our model according to our three points alone and the model lacks the generalization.

How does generalization works ?

First, we make a random line and see the distance between the point and the line, if distance is very far from all points, we move the line such that the distance between the line and points remain optimum. It will take few iterations to find a optimum line, where overall distance between the points and line is minimum.


In [81]:
Image("/Users/mayurjain/Desktop/cv_images/fit_line.png")


Out[81]:

We know the line equation y = mx + c, where m is slope and c is y-intercept. If m increases, the line moves up, as in inclination and if m decreases, then the line starts declining. At x = 0, we have y = c. It moves the line up and down along y-axis where x = 0.

While changing slope with raw of value of the points, the line tend to overshoot making it difficult reduce the distance between the point and line. So we multiple slope m and c with learning rate. Learning rate helps in taking small step and moving in close towards the point.

so the equation becomes like y = ( m + point_x*alpha ) x + (c + alpha)

Mathematical definition of slope m = ( change in y / change in x) = ( dy/dx ), Now how this change is brought into action. To perform this derivative or gradient change, we perform Gradient Descent.


In [82]:
Image("/Users/mayurjain/Desktop/cv_images/gradient_descent.png")


Out[82]:

As mentioned above, the slope = change in error /change in weight. Here, error refers to the distance between point and line, thus we are reducing the distance means error.

Error Functions

  • Mean Absolute Error
  • Mean Squared Error

In [83]:
Image("/Users/mayurjain/Desktop/cv_images/MAE.png")


Out[83]:

Mean Absolute Error

It is the absolute value of the difference between the actual point ( x,y ) and estimated point ( x,y_hat ). y_hat is the predicted output. So sum of all the sample's absolute |y - y_hat| value and divide by total number of samples. We use absolute because sometimes the point may lie below the line making it negative value and canceling out the positive value to give inaccurate result.

Mean Squared Error

It is similar to MAE, except the absolute value, we square the difference between the y and y_hat.


In [84]:
Image("/users/mayurjain/Desktop/cv_images/MSE_1.png")


Out[84]:

In [85]:
Image("/users/mayurjain/Desktop/cv_images/MSE_2.png")


Out[85]:

As we navigate through the gradient, the difference between the average sum of squared of the y and y_hat is reduced.

Linear Regression Warnings

Linear Regression Works Best When the Data is Linear

Linear regression produces a straight line model from the training data. If the relationship in the training data is not really linear, we'll need to either make adjustments (transform your training data), add features, or use another kind of model.

Linear Regression is Sensitive to Outliers

Linear regression tries to find a 'best fit' line among the training data. If the dataset has some outlying extreme values that don't fit a general pattern, they can have a surprisingly large effect.


In [ ]: